Introduction to Open Data Science - Course Project

About the project

This is my (Hi, I’m Eirik) project page for course work related to the PHD-302 Introduction to Open Data Science course hosted at University of Helsinki, autumn 2022.

Assignment 1 tasks:

Link for the GitHub repository: github.com/Eiriksen/IODS-project

Thoughts about this course right now

Found out about it from the doctoral school’s mailing lists, and signed up because I need the study credits and I hoped to learn more about general open data science concepts, especially data collaboration and using git. R I am already very familiar with so not expected to learn much new here on the technical side, but hoping to get some perspectives on how to use it in collaborative projects (where other’s have to read your code).

Thoughts about Ch 4-5 of R for Health Data Science and the following exercises:

I mostly skimmed through these chapters and the exercies; I’ve already spent more hours than I dare to admit coding in R using tidyverse, dplyr, and ggplot, so I didn’t find much here that I was not familiar with. About the book, I’m not sure if it uses the best practice for getting people started with R; It seems to be written for someone who has never touched R or any kind of coding/scripting before, and still it completely skips the very very basics of understanding how R works and instead has the user jump straight into using R studio and installing tidyverse, something I think might create a lot of confusion. Hopefully others attending this course have some experience with R on beforehand.


Chapter 2. Analysis of the Learning 2014 dataset

1) Load, explore, explain

library(tidyverse)
library(GGally)

tb_students <- read_delim("data/data-learning2014.txt")
str(tb_students)
## spec_tbl_df [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The dataset is the result of a survey of participants in the course Johdatus yhteiskuntatilastotieteeseen (Introduction to Social Statistics), fall 2014, and surveys the participants’ approaches to learning (ASSIST, section B), attitudes toward statistics (Based on SATShttps://www.evaluationandstatistics.com/scoring), as well as their points from their exams.

Info about the study: Metadata 1 Metadata 2 PPT presentation

183 of the course’s 250 students participated in the study. However, due to some missing data, there’s 166 participants in the final dataset.

As we can see from str(tb_students) above, The dataset has 166 observations of 7 variables. Each row is one participant and for each participant we know the age (numeric), gender (F/M), their exam points (num), and their scores (numeric) on four dimensions: attitude towards statistics (attitude), Deep learning approach (deep), Surface learning approach (surf), and Strategic learning approach (strategic). These scores are based several questions in the questionnaire where the students were asked to rate (from 1 to 5) how much they agreed to certain statements, for example “I usually set out to understand for myself the meaning of what we have to learn”.

2) Graphical overview, summary and comments

Let’s have a quick look at some summaries of the data:

ggpairs(
      data = tb_students %>% relocate(-gender),
      lower = list(continuous = wrap("points", alpha = 0.3, size=0.3), mapping=aes(color=gender)),
      )+
  scale_color_manual(values=c("F"="RED","M"="BLUE"))+
  scale_fill_manual(values=c("F"="RED","M"="BLUE"))

Blue represents male participants, and red female participants.

This plot shows us the distribution of the different variables (diagonal), the plotted relationships between the variables (lower triangle), and the measured correlations between them (upper triangle). Correlations are not shown for gender since it’s a categorical variable.

From the top left plot we can see that most of the participants are aged around 20-30 years old, but that there are some participants in the older age groups slightly above 50 years. From the other diagonals we can see that most variables are fairly normally distributed.

There are also far more female than male participants (bottom right plot); We can get some more details on that using a summary table:

tb_students %>% group_by(gender) %>% summarise(n=n(), p=n() / nrow(.))
## # A tibble: 2 × 3
##   gender     n     p
##   <chr>  <int> <dbl>
## 1 F        110 0.663
## 2 M         56 0.337

Which shows us that 66% (n=110) of the participants were female, and only 33% (n=56) male. However, looking at the rightmost column of the plot above, we can see that most variables don’t seem to associate strongly with gender, except maybe attitude, where the male participants are maybe scoring slightly higher. This can also be seen on the second plot of the bottom row, where it seems like for the female participants there is a higher amount of participants with a lower attitude score

Looking at the correlations, the strongest one (***) seems to be between points and attitude, where participants with a higher attitude-score seems to score more points at the exams. Surface learning also seems to correlate negatively with deep learning and strategic learning, as well as correlating negatively with attitude.

3) Fitting a regression model, summary, and comment statistical tests

Let’s fit a regression model to examine some of the patterns more closely. We’ll use the variables attidue, age, and gender

model_full <- lm(points ~ attitude + age + gender, data=tb_students)
summary(model_full)
## 
## Call:
## lm(formula = points ~ attitude + age + gender, data = tb_students)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## attitude     0.36066    0.05932   6.080 8.34e-09 ***
## age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

It seems here that only attitude has a significant association with exam points, as the t-tests of the slope estimates only give a value below 0.05 for attitude. This statistical test checks whether the slope (estimate) of an explanatory variable is significantly different from 0. The idea here is that if the slope is 0, it will randomly deviate from 0 based on the t-distribution. Using the t-test we can check if the slope is more different from 0 than what is likely for t-distribution.

In our case, only the slope for attitude is significantly different from zero; If age or gender has an effect on exam points, we’re not able to detect it here.

We’ll simplify the model by removing the non-significant explanatory variables:

model_simple <- lm(points ~ attitude, data=tb_students)

4) Summary of simplified model, explain relationship, explain multuple R-squared

summary(model_simple)
## 
## Call:
## lm(formula = points ~ attitude, data = tb_students)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Our estimated slope for attitude is 0.35. This means that for each extra point on the attitude score, the expected number of exam points goes up by 0.35. The multiple R-square value (coefficient of determination) of 0.19 tells us that our explanatory variable attitude is explaning 19% of the variation in the target variable exam points.

5) Diagnostic plots

A key assumption for a basic liner model is that the residuals of the observed variable (points) are normally distributed

There are many ways of checking this, two are shown below: 1) just plot a histogram of the residuals. 2) plot a q-q plot. These two plots basically tell us the same thing, but the q-q plot makes it a little easier how much our residuals are straying from normal.

par(mfrow=c(1,2))
hist(residuals(model_simple),breaks=50)
plot(model_simple,which=2)

Our histogram and Q-Q plot tells us that our residuals are fairly normally distributed, but there are some extreme values towards the negative end.

Side not: Personally, I initially found it a little tricky to wrap my head around what is meant by “standardised residuals” and “theoretical quantiles”, so I did some googling and tried making the q-q plot from scratch. I understand it now but it’s actually quite tricky to explain with words; Anyways, here’s the R code for a q-q plot from scratch:

plot(sort(qnorm(seq(0,1,length.out=nrow(tb_students)))), sort(scale(residuals(model_simple))))
abline(coef=c(0,1))

Another important assumptions of the linear model is homoscedasticity (as opposed to heteroscedastisity), which means that the variance of our residuals stay the same independent of the explanatory variable (there should not be higher variance among students with a high attitude). We can check this with a residuals vs fitted plot:

par(mfrow=c(1,2))
plot(model_simple,which=1)
plot(tb_students$attitude, residuals(model_simple),)
abline(coef=c(0,0))

The plot shows us that the residuals stay more or less the same independent of the “fitted values” (explanatory variable). which is good! It could look like there’s less variance among the students with a high attitude-score, but this is likely just because there’s fewer students there, so this is probably not a problem)

Also, to help understand the content of the plot, I’ve plotted the same plot twice, once using the built in residuals vs fitted plot (right), and once by making the plot from “scratch” (right)

Finally, we might want to check for extreme outliers. outliers are data points with an extremely high residual value. Data points with a high leverage are points that have a high influence on the model. Points that both have a high residual and leverage are potentially bad for the accuracy of our regression.

plot(model_simple,which=5)

In our case, it looks like there’s no problematic points as none of them are outside of Cook’s distance, which is so far way here that it is not even shown on the plot.

Overall, it looks like our model fits reasonably well, although we should keep the negatively skewed residuals in mind.


Chapter 3. Analysis of the Learning 2014 dataset

2: Read and describe data

library(tidyverse)
library(glue)
tb_alc <- read_delim("data/data-alc.txt")
print(names(tb_alc))
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures_m" "paid_m"     "absences_m"
## [31] "G1_m"       "G2_m"       "G3_m"       "failures_p" "paid_p"    
## [36] "absences_p" "G1_p"       "G2_p"       "G3_p"       "alc_use"   
## [41] "high_use"   "failures"   "absences"   "G1"         "G2"        
## [46] "G3"

The dataset is a product of questionaires handed to students of two Portuguese schools as well as course outcomes for two courses (math and portuguese), G1 G2 and G3 refers to the grades in the first period, second period, and final grade. The values G1,G2,G3, failures and absences are in this set the average of the scores for the two courses (for each student). The questionaire questions relate to demographic, social, and school related data. A full description of each variable and the dataset is available from this link Of special interest in this analysis is the values Walc and Dalc, which are scores from 1 (low) to 5 (very high) representing alcohol usage in the weekend (Walc) and in workdays (Dalc)

In this dataset we have also defined a two new variables: alc_use (the average of Walc and Dalc), and high_use (whether alc_use is higher than 2 or not)

3: Hypothesis on relationship to alcohol use

Looking at different variables I think will have some relationship with high alcohol use.

go out (negative correlation) This variable represent how much students go out with friends. Now, this might fail miserably, but my hypothesis is that the students who often go out with friends have a lower chance of being high-alcohol users; They might drink some when they are out with freinds, but going out with friends often might also correlate with a stronger social network which can be protective against high alcoholism. Hypothesis: lower high_use with higher goout

sex (some correlation) A lot of traits and social aspect differ by sex in humans (either by culture or biology), so I’ll hypothesise that number of high-alcohol users is going to be different for the two sexes (Not hypothesising about the direction of the relationship)

study time (negative correlation) My guess is that high alcohol use is symptomatic of other problems in a persons life, students in an overall bad situation might have a higher alcohol usage, and also have less focus on studying. My hypothesis is thus that a lot of study time gives a lower change of high alcohol use.

famrel (negative correlation) Going by the same argument as above. Family relations are important, and students with bad family relations might seek relief in alchol usage. My hypothesis is that bad family relations increase the probability of high alcohol use.

4: Exploring these relations

One way to explore these data is to plot the proportion of high alcohol users for different levels of our variables in question (goout, sex, studytime, and famrel).

However, by only looking at the proportion, we’re missing out on some important data: How many students there are in each category. This can tell us the strengt of the different values.

To solve this, we’ll make plots showing both the proportion of high alcohol users for different levels of our variables, but also plot the distribution (histogram) of our variable in question.

To make this process a little cleaner, we’ll make a general function for making this plot (based on the variable in question “var”)

# Set up a general function to make these plots

makeplot <- function(var){
  # Make a table where each row is one value of our explanatory variable...
  # ... and the column "p_high_use" says the proportion of high alcohol users...
  # ... there are in this this group
  tb_alc_grouped <- 
    tb_alc %>%
    group_by(!!var) %>%
    summarise(p_high_use=sum(high_use)/n())
  
  # find the max y value for the histogram we're gong to be plotting,
  # We'll use that as a scaling factor for the second y axis of the plot
  ymax <- tb_alc %>% group_by(!!var) %>% summarise(n=n()) %>% pull(n) %>% max()
  scale = ymax
  
  # make the plot, drawing both the histogram of our variable in question (x), as well as a line showing the 
  # - proportion of high alcohol users in for each value of x
  ggplot()+
    labs(title=glue("Distribution (bars) of {as.character(var)[2]} and the proportion (line) of \n high alcohol users in each category"))+
    geom_histogram(data=tb_alc,aes(x=!!var), stat="count")+
    geom_point(data=tb_alc_grouped,aes(x=!!var, y=p_high_use*scale),col="red",size=3)+
    scale_y_continuous(
      name = "N (bars)", sec.axis = sec_axis(~./(scale/100), name="% High alcohol use (line)")
    )
}

Then we can make the plots:

makeplot(quo(goout))

This plot seems to go against my initial hypothesis. It seems like there’s a higher proportion of high alcohol users among those who go out a lot. There is quite a bit fewer students going out a lot than those going out a lot (4-5) than those going out less (1-3), but probably not few enough to drive any extreme relationships; also the pattern is quite strong. Based on this graph I’m guessing my hypothesis was not right, but let’s see what the analysis says later.

makeplot(quo(sex))

This plot seems to be in line with my hypothesis that the number of high alcohol users would be different between the sexes. It looks like there more high alcohol users among the male students than among the female students. There are slightly more female students than male students, but the different is not large.

makeplot(quo(studytime))

This also seems to go with my hypothesis that students who spend a lot of time studying also have low frequencies of high alcohol use. There is also relatively few students who study a lot.

makeplot(quo(famrel))

This one is interesting, it seems like students with a poor family relations do have a higher incidence of high alcohol use, except for when the family relations are really poor (famrel=1). However, note that the sample sizes for Famrel < 3 is also very low, so it is possible that the small sample sizes are driving more extreme (or “random”) values for incidence of high alcohol use. I would therefore be a little sceptical of drawing conclusions from this graph.

5: logistc regression

We’ll start by fitting a generalized linear model using our variables and a logit link function (family=“binomial”). Then also printing out the summary of the model

model = glm( high_use ~ goout + sex + studytime + famrel, family="binomial", data=tb_alc)

To get a better idea of the parameter estimates and their confidence intervals, we’ll print those out after having done an exponential transformation on them (transforming log odds rations to odds ratios)

exp(confint(model))
## Waiting for profiling to be done...
##                  2.5 %    97.5 %
## (Intercept) 0.06545486 1.1622100
## goout       1.73873662 2.8198312
## sexM        1.31841735 3.7630180
## studytime   0.43751933 0.8576353
## famrel      0.49791884 0.8636137
summary(model)[12]$coefficients %>% as.data.frame() %>% mutate(Estimate=exp(Estimate), `Pr(>|z|)`=round(`Pr(>|z|)`,5))
##              Estimate Std. Error   z value Pr(>|z|)
## (Intercept) 0.2816141  0.7312170 -1.733025  0.08309
## goout       2.1974324  0.1230422  6.398534  0.00000
## sexM        2.2165443  0.2669277  2.981891  0.00286
## studytime   0.6179355  0.1710716 -2.813859  0.00490
## famrel      0.6575181  0.1399361 -2.996245  0.00273

It seems like all our selected variables associate significantly with high alcohol usage (at \(\alpha=0.05\)), since the p value for each variable is lower then 0.05

Going through each variable:

Time spent going out. This variable is associating positively (opposite of my hypothesis, oh well), so more time spent going out means higher probability of having a high alcohol usage. Specifically, for each point (from 1 to 5, from very low to very high ) on self reported time spent going out, the students have a 2.2 times [95% CI: 1.74-2.82] higher odds of having a high alcohol consumption.

Sex. Male-identifying students have a 2.2 times [95% CI: 1.32-3.76] higher odds of having a high alcohol consumption compared to female-identifying students. Male students drink more than female students (supporting my hypothesis that male and female students have different levels of alcohol consumptions)

Studytime. For each point (from 1 to 5, going from less than two hours to more than 10 hours) of self reported time spent studying, the students have a 0.62-fold [95% CI: 0.44-0.86] lower odds of having a high alcohol consumption. More study time = less alcohol (as hypothesised)

Family relationship. For each point (from 1 to 5, going from very bad to excellent) of self reported family relationship, the students have a 0.66-fold [95% CI: 0.50-0.86] lower odds of having a high alcohol consumption. Better family relationships = less alcohol (as hypothesised)

6: Cross tabulation

# Creating the predictions
tb_alc_pred <- 
  tb_alc %>%
  mutate(
    probability = predict(model,type="response"),
    prediction = rbinom(n=n(),1,probability)
  )

# Making the cross tabulation
crosstab <- tb_alc_pred %>%
  group_by(high_use) %>%
  summarise(
    n_wrong = sum(high_use!=prediction),
    n_right = sum(high_use==prediction)
  )

# Calculating the percentage wrong predictions
p_wrong = sum( tb_alc_pred$high_use != tb_alc_pred$prediction ) / nrow(tb_alc_pred) * 100
high_use n_wrong n_right
FALSE 65 194
TRUE 63 48

The table above shows us the cross tabulation of our predictions vs the training data. We can see that for students who did not have a high alcohol usage (“FALSE”), the model predicted 198 of them correct, and 61 wrong. For those with a high alcohol usage (“TRUE”), our model correctly predicted 48 right and 63 wrong. It seems like our model is thus more accurate for those who do not have a high alcohol usage. Going by our calculation above (p_wrong), the total percentage of incorrectly guessed students was 34.59%.

We can also plot this:

# Plotting the predictions
ggplot(tb_alc_pred)+
  aes(x=probability,y=high_use,col=factor(prediction))+
  geom_point(alpha=0.5)

The plot shows each student as one dot; The upper line is those with a high alcohol usage, and the lower line is those witout a high alcohol usage. Red points are those predicted to not have a high usage, and blue points are those predicted to have a high usage. The x-axis shows the estimated probability of high usage taken from the model.

As for the comparison with a simple guessing strategy; I found the question unclear and decided not to spend time on it. Leaving this part blank here.

7: Cross validation

# Setting up the loss function (same as in the exercise)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Running a 10-fold cross validation
library(boot)
cv <- cv.glm(data = tb_alc, cost = loss_func, glmfit = model, K = 10)

Our cross validation tell us that our function had an error of 0.2459459, which is actually better than the one used in the exercise. Looks like we already found a model with a better predictive accuracy. `


Chapter 4 Boston data

2) Load and explore

library(tidyverse)
library(GGally)
library(ggord)

tb_boston <- MASS::Boston
str(tb_boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

In the Boston dataset, each row is one town/suburb (total N=506) of the city Boston, Massachutes. The dataset contains various demographic, area-use related, environmental and infrstracture data. Full description. Some examples: - crim: Is the per capita crime rate in this area - nox: The nitrogen oxides concentrations (parts per 10 million) - rad: Index of accessibility to radial highways. - ptratio: Pupil-teacher ratio by town. etc…

3 Graphical overview

color_correlation <- function(data, mapping, method="p", use="pairwise", ...){
    # Function by user20650 on Stackoverflow (https://stackoverflow.com/a/53685979)
    # grab data
    x <- eval_data_col(data, mapping$x)
    y <- eval_data_col(data, mapping$y)

    # calculate correlation
    corr <- cor(x, y, method=method, use=use)

    # calculate colour based on correlation value
    # Here I have set a correlation of minus one to blue, 
    # zero to white, and one to red 
    # Change this to suit: possibly extend to add as an argument of `my_fn`
    colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
    fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]

    ggally_cor(data = data, mapping = mapping, ...) + 
      theme_void() +
      theme(panel.background = element_rect(fill=fill))
  }


ggpairs(
  data = tb_boston,
  upper = list(continuous = color_correlation),
  lower = list(continuous = wrap("points", alpha = 0.3, size=0.3)),
)

That’s an overwhelming amount of correlating data! It’s almost easier to describe what’s not correlating rather whan what is. I’m guessing that the data here must be a selection of some larger data set where only the most interesting/correlating data has been kept in. We can also see that very few of the variables follow simple normal distributions, often having multiple peacks, or being highle skewed either to the left or to the right.

4) standardizing

# A function for randomizing the order of rows in a table
randomize_rows <- function(tb){
  return(
   tb %>%
     mutate( order=sample(row_number()) ) %>%
     arrange(order) %>%
     select(-order)
  )
}

# Standardizing all variables, 
# Turn crim into a categorical variable, divide by quantiles in ranges of 25%
# - and randomize the order of the rows in the dataset before cutting it of into test- and training sets later
tb_boston_st <- tb_boston %>% 
  mutate_all( ~ scale(.)[,1]) %>%
  mutate(
    crim = cut(crim,quantile(crim, probs=c(0,0.25,0.5,0.75,1)),labels=c("low","med_low","med_high","high"),include.lowest=T)
  ) %>% 
  randomize_rows()

cutoff_80 = round(nrow(tb_boston_st)*0.8)
tb_boston_st_train = tb_boston_st[1:cutoff_80,]
tb_boston_st_test =  tb_boston_st[(cutoff_80+1):nrow(tb_boston_st),]

ggpairs(
  data = tb_boston_st,
  upper = list(continuous = color_correlation),
  lower = list(continuous = wrap("points", alpha = 0.3, size=0.3)),
)

After standardizing all the columns in the dataset, each variable is centered on zero, and its value is expressed in standard deviations. Because of this, it’s not easy to see the changes visually in the figure above; the only place you can really see the change is by looking at the axis values.

5) Linear discriminant analysis

lindis <- MASS::lda(crim~.,tb_boston_st_train)

print(lindis)
## Call:
## lda(crim ~ ., data = tb_boston_st_train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2493827 0.2567901 0.2493827 0.2444444 
## 
## Group means:
##                   zn      indus         chas        nox         rm        age
## low       0.95848346 -0.9120632 -0.155385496 -0.8801019  0.4360067 -0.8861534
## med_low  -0.05084164 -0.3270113 -0.007331936 -0.5550686 -0.1126280 -0.2880256
## med_high -0.38450487  0.1956216  0.234426408  0.4083171  0.1100684  0.4386633
## high     -0.48724019  1.0149946 -0.033716932  1.0029027 -0.3300043  0.7987064
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8918545 -0.6941859 -0.7348002 -0.40889554  0.37716249 -0.75509392
## med_low   0.3577450 -0.5412574 -0.4289550 -0.07406251  0.34724241 -0.15041474
## med_high -0.3706090 -0.3917183 -0.2922618 -0.24242667  0.07338671 -0.01214478
## high     -0.8519107  1.6596029  1.5294129  0.80577843 -0.67353177  0.81979542
##                  medv
## low       0.493468462
## med_low  -0.008480782
## med_high  0.144671627
## high     -0.612454341
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.12829236  0.63508634 -0.97236755
## indus   -0.03160969 -0.30313962  0.09422943
## chas    -0.05490313 -0.05442929  0.16113030
## nox      0.26629485 -0.83859719 -1.39150979
## rm      -0.08935639 -0.10749564 -0.21416707
## age      0.37934876 -0.34881102  0.01593860
## dis     -0.10428172 -0.27682460  0.14065211
## rad      3.20556311  0.78754290 -0.29645558
## tax      0.00665294  0.20367841  0.87506554
## ptratio  0.15000802  0.03150162 -0.35185687
## black   -0.16940755  0.01699321  0.17140568
## lstat    0.15158170 -0.14044517  0.38292359
## medv     0.17152741 -0.32651576 -0.17254934
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9475 0.0396 0.0130

Here, the linear discriminant analysis has reduced our dataset down to three dimensions, LD1, LD2, and LD3, where most of the variance is explained by LD1. We can also plot our points against these dimensions:

cols <- c("cadetblue1","green","lightgoldenrod","orange")
ggord(lindis, tb_boston_st_train$crim, axes=c("1","2"), cols=cols )

ggord(lindis, tb_boston_st_train$crim, axes=c("1","3"), cols=cols )

ggord(lindis, tb_boston_st_train$crim, axes=c("2","3"), cols=cols )

In terms of studing the crime in the area, it seems like the firt linear dimension LD1 is the most interesting one, as it is the only one really separating the high-crime areas from the areas with lower crimes. It also seems like the variable “rad” is the one strongest associating with LD1, does a high access to highways associate with high crime rates?

Maybe living areas close to the highways are less desirable, and so high-resource residents live further away from highways, and that this demographic associates with lower crime.

6) Prediction

# Making predictions
crim_predicted <- predict(lindis, newdata = tb_boston_st_test %>% select(-crim))

# Creating a cross tabulation
table(correct = tb_boston_st_test$crim, predicted = crim_predicted$class)
##           predicted
## correct    low med_low med_high high
##   low       13      11        2    0
##   med_low    4      14        4    0
##   med_high   0      10       15    0
##   high       0       0        1   27

Seems like our model got a few correct predicted (diagonal values). But what’s the percent correct guessed?

# Create alternative cross tabulation
tb_boston_st_test %>%
  mutate(
    predicted = crim_predicted$class
  ) %>%
  group_by(crim) %>%
  summarise(n=n(), n_pred_correct=sum(crim==predicted), p_pred_correct = n_pred_correct/n*100)
## # A tibble: 4 × 4
##   crim         n n_pred_correct p_pred_correct
##   <fct>    <int>          <int>          <dbl>
## 1 low         26             13           50  
## 2 med_low     22             14           63.6
## 3 med_high    25             15           60  
## 4 high        28             27           96.4

So, our model got about ~50% correct guesses for the categories low-med-high, but then quite a lot right for the high-crime areas! (94% correct)

7) Distances and clusters

dist_eu = dist(tb_boston_st, method="euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5267  4.9081  4.9394  6.2421 13.0045

The median euclidian distnace is 4.9, and the max distance is 13

Next, let’s try clustering our data. First, we’ll figure out what number of clusters are sensible to use. In the code below, we compare the total within-cluster sum of squares for different numbers of cluster. However, as an additional detail, I’ve done the same calculations 10 times over and then taken the mean for each choice of cluster number; This reduces the random variation in the graph and smoothens the curve a bit.

v_clusters = rep(1:15,10)
v_wcss <- sapply(v_clusters, function(k){kmeans(tb_boston %>% mutate_all(~scale(.)[,1]), k)$tot.withinss})
tb_kmeans <- data.frame(clusters = v_clusters, wcss=v_wcss) %>%
  group_by(clusters) %>%
  summarise(wcss=mean(wcss))
            
ggplot(tb_kmeans)+aes(x=clusters,y=wcss)+geom_point()

Seems like our biggest fall in total within-cluster sum of squares is when going from 1 to 2 clusters; The following additions of clusters still improve the sum of squares, but not nearly as much.

Let’s group our data by two clusters and show these clusters on the original graphical summary:

kmeans = kmeans(tb_boston %>% mutate_all(~scale(.)[,1]), 2)

tb_boston_kmeans <- tb_boston %>% 
  mutate_all(~scale(.)[,1]) %>%
  mutate(cluster=kmeans$cluster) 

ggpairs(
  data = tb_boston_kmeans,
  lower = list(continuous = wrap("points", alpha = 0.3, size=0.3), mapping=aes(color=factor(cluster))),
)

Here we can see visually which points belong to the two clusters, blue and red. What do to the clusters associate with? It seems like Blue associates with… - Higher-than-low crime rates - Low amounts of residential land zoned for lots over 25,000 sq ft (few large properties?) - High amounts of industry - High concentrations of nitrogen oxides - Somewhat lower amounts of rooms pr dwelling - More older buildings - Closer distance to employment centres - Higher accesibility of highways - Higher tax rate (full-value property-tax rate) - Lower proportion of blacks - Higher proportion of lower-status residents - Lower value of homes

Thus, one way to interpret this, is that the areas in boston can be divided into two clusters; one “high-value” cluster with low crime rates, expensive homes, nice air quality, low amounts of industry and so on; and then one “low-value” cluster with higher crime rates, less expensive homes but lower air quality and higher amounts of industry.

Bonus

Finally, lets try separating the dataset into 3 clusters, and then running a LDA analysis on it

set.seed(123)
kmeans = kmeans(tb_boston %>% mutate_all(~scale(.)[,1]), 3)

tb_boston_kmeans_3 <- tb_boston %>% 
  mutate_all(~scale(.)[,1]) %>%
  mutate(cluster=kmeans$cluster) 

lindis_clustered <- MASS::lda(cluster~.,tb_boston_kmeans_3)

lindis_clustered
## Call:
## lda(cluster ~ ., data = tb_boston_kmeans_3)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2806324 0.3992095 0.3201581 
## 
## Group means:
##         crim         zn        indus        chas         nox         rm
## 1  0.9693718 -0.4872402  1.074440092 -0.02279455  1.04197430 -0.4146077
## 2 -0.3549295 -0.4039269  0.009294842  0.11748284  0.01531993 -0.2547135
## 3 -0.4071299  0.9307491 -0.953383032 -0.12651054 -0.93243813  0.6810272
##          age        dis        rad        tax     ptratio      black
## 1  0.7666895 -0.8346743  1.5010821  1.4852884  0.73584205 -0.7605477
## 2  0.3096462 -0.2267757 -0.5759279 -0.4964651 -0.09219308  0.2473725
## 3 -1.0581385  1.0143978 -0.5976310 -0.6828704 -0.53004055  0.3582008
##         lstat       medv
## 1  0.85963373 -0.6874933
## 2  0.09168925 -0.1052456
## 3 -0.86783467  0.7338497
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim     0.03654114  0.20373943
## zn      -0.08346821  0.34784463
## indus   -0.32262409 -0.12105014
## chas    -0.04761479 -0.13327215
## nox     -0.13026254  0.15610984
## rm       0.13267423  0.44058946
## age     -0.11936644 -0.84880847
## dis      0.23454618  0.58819732
## rad     -1.96894437  0.57933028
## tax     -1.10861600  0.53984421
## ptratio -0.13087741 -0.02004405
## black    0.15432491 -0.06106305
## lstat   -0.14002173  0.14786473
## medv     0.02559139  0.37307811
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8999 0.1001

Here we can see that the linear discriminant analysis separataed our data into two dimensions, where LD1 is the most important one. The most important variables within LD1 seem to be rad (-1.96), tax (-1.11) and indus (-0.32).

Let’s have a look at this graphically:

ggord(lindis_clustered, factor(tb_boston_kmeans_3$cluster))

This shows us more or less the same. Especially cluster 1 seems to separate nicely in the LDA analysis, with taxation and and access to highways being the variables mostly separating these areas from the other ones.


Chapter 5 HDI data and Principal component analysis.

1) Data overview

1: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

library(tidyverse)
library(GGally)
library(ggfortify)
tb_human <- read_delim("data/data-human-v2.txt",)

# Note: Here, I'm using a version of the human dataset without countries as row names.


color_correlation <- function(data, mapping, method="p", use="pairwise", ...){
    # Function by user20650 on Stackoverflow (https://stackoverflow.com/a/53685979)
    # for ggpairs function, adding color by correlation  
    # grab data
    x <- eval_data_col(data, mapping$x)
    y <- eval_data_col(data, mapping$y)

    # calculate correlation
    corr <- cor(x, y, method=method, use=use)

    # calculate colour based on correlation value
    # Here I have set a correlation of minus one to blue, 
    # zero to white, and one to red 
    # Change this to suit: possibly extend to add as an argument of `my_fn`
    colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
    fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]

    ggally_cor(data = data, mapping = mapping, ...) + 
      theme_void() +
      theme(panel.background = element_rect(fill=fill))
  }


tb_human %>%  
 select(-Country) %>%
 pivot_longer(everything()) %>% 
 group_by(name) %>% 
 summarise(mean=round(mean(value),2), sd=round(sd(value),2))
## # A tibble: 8 × 3
##   name          mean       sd
##   <chr>        <dbl>    <dbl>
## 1 Ado.Birth    47.2     41.1 
## 2 Edu.Exp      13.2      2.84
## 3 Edu2.FM       0.85     0.24
## 4 GNI       17628.   18544.  
## 5 Labo.FM       0.71     0.2 
## 6 Life.Exp     71.6      8.33
## 7 Mat.Mor     149.     212.  
## 8 Parli.F      20.9     11.5
ggpairs(
  # Adjusting GNI to be expressed in thousands, to make graph more easy to read
  data = tb_human %>% mutate(GNI = GNI / 1000) %>% rename(GNI_K=GNI) %>% select(-Country),
  # Coloring by correlation
  upper = list(continuous = color_correlation),
  lower = list(continuous = wrap("points", alpha = 0.3, size=0.3)),
) + 
  theme(axis.text = element_text(size=5))

Summaries of the different variables:

  • Edu2.FM: The mean ratio of female-to-male having at least some secondary education is 0.85, indicating that for most countries there are more male individuals with secondary eduction than female. There’s quite a lot of spread here, with many countries having even more male-favoring education systems, some very few countries having female-favoring system, and then quite a few countries having an equal amount of male- and female individuals having secondary eduction.

  • Labo.FM: Very few (maybe 4) countries have a female-biased workforce; most countries are strongly male-biased (mean ratio is 0.75), and there are quite a few countries with very low female participation in the workforce.

  • Edu.Exp: The average expected time of education is 13.18 years, though this variable is fairly bell-shaped with some countries having higher and lower average time spans (SD=2.84 years)

  • Life.Exp: Average life expectancy is 71.6 years, with a left-skewed distributions (more countries with long lifespans than low lifespan). However, there are some countries with very low expected lifespans, going down to some ~50 years.

  • GNI: The average national income (per capita) is 17 627.90. The distriubtion here is strongly right-skewed, meaning there are some countries with vastly higher income (up to some ~125 000 pr capita)

  • Mat.Mor: THe mean rate of maternal mortality is at 149 deaths pr 100 000 births, but the distribution is here highly right skewed, meaning some countires have a much higher maternal mortality rate (up to almost 1000 deaths pr 100 000 births).

  • Ado.Birth: The mean rate of births among adolescents is 47.16 births pr 1000 women ages 15-19. This measure is also fairly right skewed, with many countries having higher rates of births among adolescents (up to 200 pr 1000)

  • Parli.f: The average share of seats in parliaments held by females is at some 20%. This relationship is # select(-Country) %>% some countries having 0%- , and some countries having up to 60% of parliament seats held by female politiciants (SD = 11.5%)

Notable correlations:

  • There is a positive correlation between expected years of eduction, F/M ratio for higher education, life expectancy and GNI. People living in high-income countries have longer exptected educations, more women in higher education and longer life-expectancy.

  • On the other hand, rates of maternal mortality and births among adolescents correlate negatively to these measures.

  • Finally, the amount of females in parliament, and the share of females in the workforce don’t correlate strongly with any of these variables. The correaltions it does have, however, are somewhat counterintuitive; Countries with a high amount of females in the workforce also have a higher maternal mortality.

2) PCA

2: Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

First, running the PCA analysis and printing out the principal components:

# Running PCA
pca_human <- prcomp(tb_human %>% select(-Country))
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03

Let’s look at how much variation is explained by different components:

# Displaying variability captured by different principal components
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

Note how PC1 seems to capture most of the variance, and that GNI seems to be the most important value in PC1

Let’s draw the biplot:

autoplot(pca_human, data = tb_human, loadings.label = TRUE, loadings.label.size = 3)

PCA biplot of HDI data (unstandardized). The plot seems to indicate that most of the variation in our dataset can be captured by principal component 1, which is strongly associated with the gross national income par capita. Principal component 2 is far less importan (explains 0.01% of the variation); here maternal mortality rates is the most strongly associated variable.

Interesting; we’ll discuss more below.

3) PCA post standardizing

3: Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

tb_human_st <- 
  tb_human %>% 
  mutate_at(vars(-"Country"), 
    ~ scale(.)[,1]
    )

pca_human_st <- prcomp(tb_human_st %>% select(-Country))
pca_human_st
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
summary(pca_human_st)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

Now this changes things! PC1 is still the most important, but not by far as much as previously.

autoplot(pca_human_st, x=1,y=2, data = tb_human, loadings.label = TRUE, loadings.label.size = 3)

biplot of HDI data (standarized). Here, the data is separating into two axes: PC1 (explains the most variation at 53.6%) relates to variables related mainly to economic- and health care development (such as life expectancy, gross national income per capita, expected number of years in eduction, maternal mortality rates, adolescent birth rates, but also proportion in females in higher education). The other (PC2) explains less variation (16.24%) axis relates female participation in politics and in the workforce.

The results are obviously very differently. The reason for this is that the variables are now expressed in relation to their own variation, not their absolute size. Previously, GNI was the most important variable partly because it also held the largest values and the most variation (same goes for maternal mortality), after adjusting it so that it is expressed in standard deviations, it is analysed on the same basis as the other variables. Thus, after standardizing, we’re analysing the dataset based on the relative variation in the selected variables, not the absolute variation.

4) Interpretations

4: Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)

We can see that the data is divided on mainly two axes: PC1 seem to include variables related to overall economic and health-care development such as GNI, life expectancy, maternal mortality rates, and adolescent birth rates. The other axis seems to contain values related to female participation in the workforce and in politics. It is interesting to note that these two axis separate quite strongly, almost as if economic development and gender equality in the workforce and in politics is separated. On that note, it is also interesting that gender equality in education is one the same axis as economic development (PC1), but not on the “gender-participation” axis (PC2)

5) Tea data

5: The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data. Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
tb_tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

The dataset countains 300 observations (people), with columns representing how they answered various questions. Almost all variables are factors with fairly few levels, often only two. Age is the only variable expressed as a number. The datasets seems to contain data about how people drink their tea, how often, and then with some question about personal details, and about how they feel about tea drinking.

For the multiple correspondent analysis, we’re only going to look at the first 19 columns as they seem to be the most important ones for traits related to tea drinking in itself.

mca_tea<-  FactoMineR::MCA(tb_tea, quanti.sup=19, quali.sup=c(20:36), graph=FALSE)


# visualize MCA
plot(mca_tea, invisible=c("ind","quali.sup","quanti.sup"), graph.type = "classic",cex=0.7)

plot(mca_tea, invisible=c("var","quali.sup","quanti.sup"), graph.type = "classic",cex=0.7)

The first plot shows us how different answers relate to the two main principal component, while the second plot shows us where different individuals fall on these two dimensions. We can see that the populations is fairly homogeneous, there aren’t any obvious groupings among the respondents. On the first graph, we can try to make sense of what the two dimensions associate with. On the upper extreme of Dim 1 we see answers such as “tearoom”, “chainstore+teashop”, “tea bag+unpackaged”, “always”, “pub”, “lunch”, and “work”, while on the lower extreme we have answers like “not.tea time”, “not work”, “p_cheap”, “not.home”, “not.tearoom” etc. One way to interpret this is that dimension one separates frequent tea drinkers from those drinking tea rarely. The second dimension could relate to buying patterns of tea. Those on the higher end of Dim 2 like to buy more expensive, unpackaged teas from specialiced tea shops, while those on the lower end buy their teas from chain stores, use tea bags, have milk in their tea, and drink Earl Grey.


Assignment 6. Analysis of longitudinal data

1) Analysis of RATS data via summary statistics

The RATS data set contain data from 16 subjects (rats), which were measured over 11 days during a 9 week long experiment. The rats were divided among 3 feeding treatments, and we’re looking to see if there is a difference in growth rate between the treatments.

First, loading in the data (long form):

library(tidyverse)

tb_RATS <- read_delim(
  file = "data/data-rats.txt",
  col_types = list(ID = col_factor(), Group = col_factor())
  )

Graphical overviews

tb_RATS %>% 
  ggplot()+
  aes(x=day, y=weight, label=ID, col=Group)+
  geom_text()+
  labs(title="A) Weight of rats over the study time period", subtitle = "Colors:groups, Numbers: individual ID")

tb_RATS %>% 
  ggplot()+
  aes(x=day, y=weight, col=ID, facet=Group)+
  geom_line(size=1,alpha=0.6)+
  facet_wrap(~Group) +
  labs(title="B) Weight of rats over the study time period", subtitle = "Facets indicate groups 1-3, each line is one rat")


tb_RATS %>%
  ggplot()+
  aes(x=day,y=weight,col=Group, shape=Group)+
  stat_summary(fun.data=mean_se, geom="errorbar", position = position_dodge(2), width=3,size=0.9)+
  stat_summary(fun=mean, geom="line", position = position_dodge(2), size=1, alpha=0.2)+
  stat_summary(fun=mean, geom="point", position = position_dodge(2), size=3)+
  labs(title="C) Mean weight (±SE)of rats in each group (1-3) over the study time period", 
       subtitle = "Points are dodged slightly to avoid overlap",
       y="Mean weight (±SE)")

Here I’ve use a few different plotting approaches to look at how the weight of the subjects (rats) change over the study period.

  • A) Looks a little messy, here each weight of each individual is plotted against time, and the individuals are identified by the numerical ID. This gives us an idea of what the data looks like, but it’s hard to separate some individuals.

  • B) Here, I’ve used line plots to connect the measurements of the different individuals so we an more easily see the changes over time. The three facets (subplots) are fore the tree feed groups. We can clearly see there is some difference between the groups, we can track each individual, and we can see that in group 2 there is one individual that is much larger than the others in that group.

  • C) Here, we’re using summarized data for each day and treatment. We can see the mean weight of each group for each measurement day, with the standard errors to gives us an idea of how different the groups are. However, here, we’re hiding the original datapoints, but showing them would also make the plot very messy. In any case, we can see that the weights are increasing over time in all the groups. However, it’s hard to tell if there is a difference in growth rate, which is what we’re interested in finding out from this data.

Grouped summaries and plots

tb_RATS %>% 
  ggplot()+
  aes(x=Group, y=weight)+
  geom_boxplot()+
  labs(title="A) Mean weights of rats in each treatment group, for the entire study period")

tb_RATS %>% 
  ggplot()+
  aes(x=Group, y=weight)+
  geom_violin(alpha=0.3)+
  geom_jitter(aes(col=ID))+
  labs(title="B) Distributions of weights of rats in each treatment group (violins)",
       subtitle="Points indicate single measures of individuals, colors indicate individual")

Here are some more summary plots, similar to those used in Chapter 8 of the MABS plot. Each plot simply shows the average weight recorded in all treatments (over the entire study period). These plots really are not that useful for figuring out if there is a difference in growth rate between the treatments, but they do show the absolute weight differences.

One interesting thing to note: In plot A it looks like we have some outliers (group 2). I decided to investigate this further and made an alternative plot which shows the original data points (with violin graphs in the background showing the distribution of the points), and here we can see that while there are some points which look like outliers when looked at together, we can see that within that individual they are really not outliers but fall within a reasonable range for the purple individual in group 2. I think this underscores the importance of account for individual ID when working with these kind of data. Another way to make this plot would be to not use every single data point from each individual but instead use their mean weight over the study as a data point. -still, that’s throwing away good data!

An attempt to do a decent summary of growth

tb_RATS %>%
  group_by(ID,Group) %>%
  summarise(slope = lm(weight~day)$coefficients[2]) %>%
  ggplot()+
  aes(x=Group,y=slope)+
  geom_violin(alpha=0.5, color=NA, fill="gray")+
  geom_jitter(width=0.1,height=0)+
  stat_summary(fun.data=mean_se, geom="errorbar",width=0.3,size=1,alpha=0.3)+
  stat_summary(fun=mean, geom="point",size=5,alpha=0.3)+
  labs(
    title="Difference in weight gain for different groups of rats",
    subtitle="Large points and errorbars indicate group means ± SE",
    y="Estimated weight gain (g/day)"
  )

The previous graphs were not very usefull for telling if there is a difference in growth rate between the groups. We need some measure of growth rate. One way to do this could have been to take the difference between the first and last data point and then use that value for growth rate – but then we’d be throwing away all the data between the first and last point; not very exciting.

Really the way to do this would be to use a linear mixed effect model, but I’m not supposed to in this part of the assignment so I cam up with another “hacky” solution. In the graph above, I have for each individual done a small linear model on their growth data and then extracted the slope from that model. Thus, each data point in the graph above is the “slope” for each individual. Then, in the graph above, we can compare the slopes between the three groups by looking at the means and the standard errors (though I’ve also plotted the original data points and the distributions). What we see is that group 1 definitively looks to have a lower growth rate then group 2 and 3, but I wouldn’t be so sure about the difference between group 2 and 3.

Keep in mind that this is not really a good solution though, because we’re essentially doing analysis on parameters (the slopes we calculated); having calculated the means and stander errors of these, we’re essentially calculating parameters of parameters! We’ll most certainly be violating some assumptions when we put these slope-data into any statistical test.

…so let’s do that next!

Anova analysis

model_fit <- 
  tb_RATS %>%
  group_by(ID,Group) %>%
  summarise(slope = lm(weight~day)$coefficients[2]) %>%
  lm(slope ~ Group, data=.) 

anova(model_fit)
## Analysis of Variance Table
## 
## Response: slope
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Group      2 1.00665 0.50332  7.5743 0.006594 **
## Residuals 13 0.86387 0.06645                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In chapter 8 of the MABS book, they do both a T test and an anova. However, we’re dealing with three groups here (not two), so a T-test is out of the question. (we could do multiple t-tests, but then we’d also have to do multiple-test correction). Instead, I’m only doing an ANOVA analysis on the slopes I calculated. What the summary tells us is that there is a significant difference in slopes associated with the groups, but it can’t tell us which group is different from which. This is about as far as we get without turning to proper linear mixed models.

2) Analysis of BPRS data via mixed-effect models

This data contains data for 40 subjects which were given BPRS scores (a psychological rating scale, indicating schizophrenia) over 8 weeks, and were put in one of two treatment groups.

Loading in the data:

library(lmerTest)

tb_BPRS <- read_delim(
  file = "data/data-BPRS.txt",
  col_types = list(subject = col_factor(), treatment = col_factor())
  )

Graphical overview

tb_BPRS %>% 
  ggplot()+
  aes(x=week, y=BPRS, col=subject, facet=treatment)+
  geom_line()+
  facet_wrap(~treatment)+
  labs(title="A) BPRS scores of subject the study time period")

tb_BPRS %>%
  group_by(week) %>%
  mutate(BPRS_scaled = scale(BPRS)) %>%
  ggplot()+
  aes(x=week, y=BPRS_scaled, col=subject, facet=treatment)+
  geom_line()+
  facet_wrap(~treatment)+
  labs(title="B) scaled BPRS scores of subject the study time period")

The two plots above show us how BPRS scores of all the participating individuals developed over the 8 weeks. Plot B) is supposed to be easier to read because the variables are scaled (scaled for each week), but I honestly don’t find that plot much more helpful than the first one. In any case it seems that in both treatments the BPRS scores go down over time, but we can’t really tell how much and if there is a difference between the groups.

To tell that, we’ll need to do some modelling.

Fitting some models:

Note: To save space, for the following summaries I’m only printing the model coefficients.

As in MABS Ch9 I’ll fit some linear models (and linear mixed models) of increasing complexity, and then I’ll compare them in the end.

First, fitting a completely ordinary linear model:

mod_1_normal <- lm( BPRS ~ week + treatment, data=tb_BPRS)
summary(mod_1_normal)$coefficients
##               Estimate Std. Error    t value      Pr(>|t|)
## (Intercept) 46.4538889  1.3670163 33.9819579 6.256351e-114
## week        -2.2704167  0.2524021 -8.9952367  1.412930e-17
## treatment2   0.5722222  1.3033989  0.4390231  6.609104e-01

According to this first model, the BPRS scores significantly go down for each week (by -2.27 points pr week), but the effect of the treatments is more uncertain (treatment 2 might have larger BPRS scores then treatment 1, but it is not a significant difference so we can’t tell).

Next, let’s try fitting a mixed effect (ME) model, letting subject ID act as a random effect on the intercept:

mod_2_rInt <- lmer( BPRS ~ week + treatment + (1|subject), data=tb_BPRS)
summary(mod_2_rInt)$coefficients
##               Estimate Std. Error        df     t value     Pr(>|t|)
## (Intercept) 46.4538889  2.4095150  43.20949  19.2793521 7.596500e-23
## week        -2.2704167  0.1505526 319.00000 -15.0805494 3.605128e-39
## treatment2   0.5722222  3.2994257  38.00000   0.1734309 8.632333e-01

Using this model gives us a very similar result. The significance of the effect of time increased, but we’re still uncertain about the effect of treatment 2.

Next: Increasing the complexity, same model as above but now subjects have a random effect on the effect of time (week) instead of a random effect on the intercept.

mod_3_rInt_rSlope <- lmer( BPRS ~ week + treatment + (week|subject), data=tb_BPRS)
summary(mod_3_rInt_rSlope)$coefficients
##              Estimate Std. Error       df    t value     Pr(>|t|)
## (Intercept) 45.982973  2.7054200 47.78312 16.9966118 6.816057e-22
## week        -2.270417  0.2747198 38.99793 -8.2644820 4.216588e-10
## treatment2   1.514053  3.2207855 37.99912  0.4700881 6.409789e-01

Still a similar result. The effect size of the treatment is higher now, yet still non-significant.

Finally, same model as above, but now with an interaction fitted between week and treatment:

mod_4_rInt_rSlope_intractn <- lmer( BPRS ~ week * treatment + (week|subject), data=tb_BPRS)
summary(mod_4_rInt_rSlope_intractn)$coefficients
##                   Estimate Std. Error       df    t value     Pr(>|t|)
## (Intercept)     47.8855556  3.0614637 38.00344 15.6413926 3.802479e-18
## week            -2.6283333  0.3849267 38.00052 -6.8281404 4.204415e-08
## treatment2      -2.2911111  4.3295635 38.00344 -0.5291783 5.997594e-01
## week:treatment2  0.7158333  0.5443685 38.00052  1.3149793 1.963973e-01

The effect of time (week) is still very similar here (and significant), but the treatment effect has changed a lot! Treatment 2 now gives a lower BPRS score (-2.29) than treatment. Also, the effect of time (week) now differs between the treatments; In treatment 1, subject go down in BPRS faster than in group 2. However, both the treatment effect and the interaction is still non-significant here.

Let’s compare the models using an ANOVA analysis:

anova(
  mod_2_rInt,
  mod_3_rInt_rSlope,
  mod_4_rInt_rSlope_intractn
)
## Data: tb_BPRS
## Models:
## mod_2_rInt: BPRS ~ week + treatment + (1 | subject)
## mod_3_rInt_rSlope: BPRS ~ week + treatment + (week | subject)
## mod_4_rInt_rSlope_intractn: BPRS ~ week * treatment + (week | subject)
##                            npar    AIC    BIC  logLik deviance  Chisq Df
## mod_2_rInt                    5 2582.9 2602.3 -1286.5   2572.9          
## mod_3_rInt_rSlope             7 2523.2 2550.4 -1254.6   2509.2 63.663  2
## mod_4_rInt_rSlope_intractn    8 2523.5 2554.6 -1253.7   2507.5  1.780  1
##                            Pr(>Chisq)    
## mod_2_rInt                               
## mod_3_rInt_rSlope           1.499e-14 ***
## mod_4_rInt_rSlope_intractn     0.1821    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis tells us that model 3 (including ID as a random effect on time, but not including the interaction) has the highest likelihood (-the probability of observing our data if this model was true), and significantly so. This indicates model 3 could be the best model.

Let’s also try doing some cross validation to compare the models. The following code splits the data set into two (even/odd subject IDs), fits each model separately to each data set (giving two fits), and uses the parameter from each fit to predict the data not used in the model. It then does this procedure for each model:

set.seed(1337)
library(cvms)
# Cross validation (split based on batch number)
cvms::cross_validate(
  data = tb_BPRS %>% 
    mutate(
      id_numeric = as.numeric(substr(subject,2,4)),
      .folds=factor(ifelse((id_numeric %% 2) == 0,yes=1,no=2))
      ),
  models = c(
    "BPRS ~ week + treatment",
    "BPRS ~ week + treatment + (1|subject)",
    "BPRS ~ week + treatment + (week|subject)",
    "BPRS ~ week * treatment + (week|subject)"
    ),
  family="gaussian",
  REML=T
  ) %>% select(Fixed,Random,RMSE,AICc)
## # A tibble: 4 × 4
##   Fixed          Random          RMSE  AICc
##   <chr>          <chr>          <dbl> <dbl>
## 1 week+treatment <NA>            13.8 1396.
## 2 week+treatment (1|subject)     13.8 1272.
## 3 week+treatment (week|subject)  13.9 1245.
## 4 week*treatment (week|subject)  13.8 1245.

The cross validation does not seem to quite agree with the ANOVA analysis, but it’s close. The Residual Mean Square Error (RMSE, a measure of how close the predictions were to the observed data, smaller is better) is best for model 4; so model 4 does marginally better on cross validation. The ICC (not related to cross validation, but a measure of model fit compensating for complexity) is also lower (better) for model 4. However, the difference is very very small.

The safest thing to conclude is probably that time in treatment had an effect on BPRS scores (both model 3 and 4 agrees on this), and that there may be a difference between the treatments in how the subjects develop over time, but this effect is somewhat uncertain (only model 4 gives us a significant week x treatment interaction).

Finally, let’s plot the original data vs the predicted data from model 4:

tb_BPRS$fitted = F

tb_BPRS_fitted <- 
  tb_BPRS %>% 
  mutate(BPRS = fitted(mod_4_rInt_rSlope_intractn), fitted=T)


ggplot(
  bind_rows(tb_BPRS,tb_BPRS_fitted)
  )+
  aes(x=week,y=BPRS,col=subject,facet=fitted)+
  geom_line()+
  facet_wrap(~interaction(treatment,fitted))

The plot above shows the observed data for group 1 and 2 (top row), and the predicted data for the same groups (bottom row). The main observation here is that is that the predicted data looks fairly similar to the observed data, meaning our model is not completely off.

Thanks for grading this assignment, and have a happy holidays!!!